Introduction

I am exploring the white wine data set to find interesting patterns and to expore the most important variables in predicting subjective wine quality.

I decided not to research known relationships between physicochemical properties or attempts by others at predicting wine quality, but rather let the data speak for itself, in order to minimize bias. I will develop some models for predicting quality and at a later stage compare my findings with those of the original paper on this dataset by Cohen et al.

First I am going to look at the structure of the data, the data types and some basic descriptive statistics. I want to check for any missing data or other data quality issues.

Options have been set, now load packages.

Loading the helper functions. Used references [6], [7] and [8] in constructing these.

Data used for this analysis

The data is contained in a file “wineQualityWhites.csv”.

This dataset is public, available for research. The details are described in:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

There are 11 physicochemical variables recorded from tests.

1 - fixed acidity (tartaric acid - g / dm^3)

2 - volatile acidity (acetic acid - g / dm^3)

3 - citric acid (g / dm^3)

4 - residual sugar (g / dm^3)

5 - chlorides (sodium chloride - g / dm^3

6 - free sulfur dioxide (mg / dm^3)

7 - total sulfur dioxide (mg / dm^3)

8 - density (g / cm^3)

9 - pH

10 - sulphates (potassium sulphate - g / dm^3)

11 - alcohol (% by volume)

(Any information below on the physicochemical variables is from the Udacity project data description).

Related to each wine is an evaluation of the quality which is a score between 0 and 10. There are at least three ratings and the median of these is taken (I assume an odd number of ratings as there are only whole integers).

I want to find out any relationships between the physicochemical variables themselves and between them and the quality, and then the relative importance of the variables in a potential prediction model.

Loading the data and looking at the structure.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Calculate statistics for each variable.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

X just seems to be an identifier giving the row/observation number. The only other integer is the quality, which is known to be on an ordinal scale from 0 to 10.

The X column can be dropped.

The quality column has a data type of int, but it might be better analyzed as an ordered category. To get the best of both worlds, add a version of the quality column that is ordered category.

There is at least one value where citric acid has a value of zero. This might be actually zero or a trace amount rounded down to zero. Check how many rows that is. If not too many, take a look at them.

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 116            5.5            0.485           0            1.5     0.065
## 301            6.1            0.410           0            1.6     0.063
## 303            6.1            0.410           0            1.6     0.063
## 781            6.0            0.395           0            1.4     0.042
## 863            4.8            0.340           0            6.5     0.028
## 865            4.8            0.330           0            6.5     0.028
## 891            5.0            0.310           0            6.4     0.046
## 1153           6.1            0.600           0            1.3     0.042
## 1818           7.2            0.500           0            0.8     0.034
## 2322           4.6            0.445           0            1.4     0.053
## 2630           5.8            0.600           0            1.3     0.044
## 3455           5.8            0.540           0            1.4     0.033
## 3572           5.9            0.655           0            5.6     0.033
## 3636           5.8            0.580           0            1.5     0.020
## 4299           5.6            0.260           0           10.2     0.038
## 4598           6.7            0.660           0           13.0     0.033
## 4780           6.0            0.590           0            0.8     0.037
## 4793           4.7            0.785           0            3.4     0.036
## 4878           5.9            0.540           0            0.8     0.032
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 116                    8                  103 0.99400 3.63      0.40
## 301                   36                   87 0.99140 3.27      0.67
## 303                   36                   87 0.99140 3.27      0.67
## 781                    7                   55 0.99135 3.37      0.38
## 863                   33                  163 0.99390 3.36      0.61
## 865                   34                  163 0.99370 3.35      0.61
## 891                   43                  166 0.99400 3.30      0.63
## 1153                  24                   79 0.99370 3.31      0.38
## 1818                  46                  114 0.99320 3.19      0.34
## 2322                  11                  178 0.99426 3.79      0.55
## 2630                  72                  197 0.99202 3.56      0.43
## 3455                  40                  107 0.98918 3.26      0.35
## 3572                   8                   31 0.99360 3.32      0.51
## 3636                  33                   96 0.98918 3.29      0.38
## 4299                  13                  111 0.99315 3.44      0.46
## 4598                  32                   75 0.99551 3.15      0.50
## 4780                  30                   95 0.99032 3.10      0.40
## 4793                  23                  134 0.98981 3.53      0.92
## 4878                  12                   82 0.99286 3.25      0.36
##      alcohol quality quality.cat
## 116      9.7       4           4
## 301     10.8       6           6
## 303     10.8       6           6
## 781     11.2       4           4
## 863      9.9       6           6
## 865      9.9       5           5
## 891      9.9       6           6
## 1153     9.4       4           4
## 1818     9.2       4           4
## 2322    10.2       5           5
## 2630    10.9       5           5
## 3455    12.4       5           5
## 3572    10.5       4           4
## 3636    12.4       6           6
## 4299    12.4       6           6
## 4598    10.7       6           6
## 4780    10.9       4           4
## 4793    13.8       6           6
## 4878     8.8       5           5

There is nothing about the other values that suggests this is a special group of some kind, so I’ll simply accept the values are valid: no citric acid.

Back to the set of statistics. There are some large maximum values compared to the medians: residual.sugar, chlorides, free.sulfur.dioxide, and to a lesser degree total.sulfur.dioxide. I’ll look closer at these in the univariate analysis.

Need to check whether there are any missing values.

##        fixed.acidity     volatile.acidity          citric.acid 
##                    0                    0                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    0                    0 
## total.sulfur.dioxide              density                   pH 
##                    0                    0                    0 
##            sulphates              alcohol              quality 
##                    0                    0                    0 
##          quality.cat 
##                    0

Nothing missing, so on with the plotting.

Univariate Plots Section

To get a feel for the values, create a view of the first rows.

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality quality.cat
## 1       6           6
## 2       6           6
## 3       6           6
## 4       6           6
## 5       6           6
## 6       6           6

I can use the granularity shown to zoom in on the plots. I will make a quick visualization with the helper function, then zoom in. I will try to see the finest level of detail to check for patterns.

Fixed acidity

Most acids involved with wine are fixed or nonvolatile (do not evaporate readily).

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

The tiny spikes in between the values at 0.1 granularity indicate some values have another decimal place. Look at the individual values.

## 
##  3.8  3.9  4.2  4.4  4.5  4.6  4.7  4.8  4.9    5  5.1  5.2  5.3  5.4  5.5 
##    1    1    2    3    1    1    5    9    7   24   23   28   27   28   31 
##  5.6  5.7  5.8  5.9    6  6.1 6.15  6.2  6.3  6.4 6.45  6.5  6.6  6.7  6.8 
##   71   88  121  103  184  155    2  192  188  280    1  225  290  236  308 
##  6.9    7  7.1 7.15  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9    8  8.1  8.2 
##  241  232  200    2  206  178  194  123  153   93   93   74   80   56   56 
##  8.3  8.4  8.5  8.6  8.7  8.8  8.9    9  9.1  9.2  9.3  9.4  9.5  9.6  9.7 
##   52   35   32   25   15   18   16   17    6   21    3   11    2    5    4 
##  9.8  9.9   10 10.2 10.3 10.7 11.8 14.2 
##    8    2    3    1    2    2    1    1

There are a few x.x5 values that fall midway between two x.x values, e.g. 6.1, 6.15, 6.2. The frequencies of even numbers are higher in general than those of odd numbers and the x.x5 values do not make up the difference. So there are some rounding effects here or some effect of the equipment output. If this occurs in other variables it won’t be commented on further. There is no reason to standardize/round the values, as this would just remove information. Decided to leave them as they are.

Volatile acidity

The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. But we don’t know what “too high” is.

Only a slight skew to the right. View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Another possibility is to examine the ratio of fixed to volatile acidity.

Citric acid

Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

It seems that there is a much larger number of one value than its position in the distribution would expect. Again, let’s look at the values.

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##   19    7    6    2   12    5    6   12    4   12   14    1   19   17   27 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   23   33   27   49   48   70   66  104   83  181  136  219  216  282  223 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##  307  200  257  183  225  137  177  134  122  101  117   82   95   37   63 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   46   51   38   39  215   35   25   23   16   19   11   22   13   21    6 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    6    9   14    4    6    8    7    7    7    5    3    9    5    5   41 
## 0.78 0.79  0.8 0.81 0.82 0.86 0.88 0.91 0.99    1 1.23 1.66 
##    2    2    2    2    2    1    1    2    1    5    1    1

So there are 215 values of 0.49 compared to neighbours of 39 and 35. That’s an anomaly that merits investigation. Filter out these values and check the distribution of the other variables.

##  fixed.acidity    volatile.acidity  citric.acid   residual.sugar  
##  Min.   : 5.600   Min.   :0.0800   Min.   :0.49   Min.   : 0.900  
##  1st Qu.: 6.800   1st Qu.:0.2000   1st Qu.:0.49   1st Qu.: 1.500  
##  Median : 7.400   Median :0.2500   Median :0.49   Median : 5.000  
##  Mean   : 7.489   Mean   :0.2629   Mean   :0.49   Mean   : 5.793  
##  3rd Qu.: 8.000   3rd Qu.:0.3000   3rd Qu.:0.49   3rd Qu.: 8.100  
##  Max.   :14.200   Max.   :0.8500   Max.   :0.49   Max.   :23.500  
##                                                                   
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.02700   Min.   : 3.00       Min.   : 18.0       
##  1st Qu.:0.03600   1st Qu.:21.50       1st Qu.:113.5       
##  Median :0.04400   Median :32.00       Median :138.0       
##  Mean   :0.04558   Mean   :33.61       Mean   :141.2       
##  3rd Qu.:0.05100   3rd Qu.:45.00       3rd Qu.:164.0       
##  Max.   :0.23900   Max.   :87.00       Max.   :247.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9893   Min.   :2.850   Min.   :0.2700   Min.   : 8.50  
##  1st Qu.:0.9928   1st Qu.:3.065   1st Qu.:0.3750   1st Qu.: 9.70  
##  Median :0.9940   Median :3.140   Median :0.4500   Median :10.50  
##  Mean   :0.9943   Mean   :3.163   Mean   :0.4623   Mean   :10.48  
##  3rd Qu.:0.9956   3rd Qu.:3.240   3rd Qu.:0.5300   3rd Qu.:11.20  
##  Max.   :1.0024   Max.   :3.650   Max.   :0.9800   Max.   :13.00  
##                                                                   
##     quality      quality.cat
##  Min.   :4.000   3:  0      
##  1st Qu.:5.000   4:  9      
##  Median :6.000   5: 54      
##  Mean   :5.893   6:110      
##  3rd Qu.:6.000   7: 36      
##  Max.   :9.000   8:  5      
##                  9:  1
## [1] "Count: 215 of 4898, Share: 4.39%"

Nothing stands out from scanning the data. Perhaps there was for some wines a carefully controlled process for exactly this level of citric acid.

Residual sugar

The amount of sugar remaining after fermentation stops. It’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. The values are in grams per cubic decimeter, which is the same as grams per liter. How many sweet wines are there by this definition?

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 2782           7.8            0.965         0.6           65.8     0.074
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 2782                   8                  160 1.03898 3.39      0.69
##      alcohol quality quality.cat
## 2782    11.7       6           6

Just one wine with a value of 65.8 would be considered sweet by this definition, which seems improbable but may have something to do with how the wines for this data set were selected. According to Wikipedia, in the EU the following thresholds apply: dry: < 4g/l medium dry 4-12g/l medium 12-45g/l sweet > 45g/l

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

This is quite different from the other distributions, that have a more heaped and symmetrical distribution. There are some higher frequencies around 1 to 2.

It has the characteristics of a value that is being targeted, rather than a natural random variation. Or perhaps multiple distributions overlayed on each other, like particular brands of wine having ranges of sweetness.

It is rare to find wines with less than 1. How many of those are there?

##  fixed.acidity   volatile.acidity  citric.acid     residual.sugar  
##  Min.   :4.500   Min.   :0.1200   Min.   :0.0000   Min.   :0.6000  
##  1st Qu.:5.900   1st Qu.:0.2000   1st Qu.:0.2700   1st Qu.:0.8000  
##  Median :6.500   Median :0.2400   Median :0.3100   Median :0.9000  
##  Mean   :6.716   Mean   :0.2857   Mean   :0.2992   Mean   :0.8442  
##  3rd Qu.:7.200   3rd Qu.:0.3300   3rd Qu.:0.3600   3rd Qu.:0.9000  
##  Max.   :9.200   Max.   :0.9050   Max.   :0.7400   Max.   :0.9500  
##                                                                    
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01900   Min.   : 3.00       Min.   : 21.00      
##  1st Qu.:0.03200   1st Qu.:12.00       1st Qu.: 72.00      
##  Median :0.03700   Median :22.00       Median : 95.00      
##  Mean   :0.03984   Mean   :25.69       Mean   : 98.53      
##  3rd Qu.:0.04200   3rd Qu.:36.00       3rd Qu.:126.00      
##  Max.   :0.18500   Max.   :89.00       Max.   :204.00      
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9885   Min.   :2.800   Min.   :0.2800   Min.   : 8.00  
##  1st Qu.:0.9903   1st Qu.:3.080   1st Qu.:0.3500   1st Qu.: 9.80  
##  Median :0.9916   Median :3.210   Median :0.3900   Median :10.60  
##  Mean   :0.9914   Mean   :3.187   Mean   :0.4181   Mean   :10.58  
##  3rd Qu.:0.9925   3rd Qu.:3.280   3rd Qu.:0.4700   3rd Qu.:11.40  
##  Max.   :0.9936   Max.   :3.540   Max.   :0.7700   Max.   :13.10  
##                                                                   
##     quality      quality.cat
##  Min.   :3.000   3: 1       
##  1st Qu.:5.000   4:11       
##  Median :5.000   5:29       
##  Mean   :5.377   6:31       
##  3rd Qu.:6.000   7: 4       
##  Max.   :8.000   8: 1       
##                  9: 0
## [1] "Count: 77 of 4898, Share: 1.57%"

77 out of 4898 can be considered rare. Ca. 1.6% seems ok.

Maybe this data doesn’t follow a more “normal” distribution because of attempts to create a low residual sugar and keep the wine dry. It might be a candidate for transformation, but it seems more like two superimposed distributions than one skewed distribution. One distribution below 20, superimposed with one distribution below 2.

Chlorides

The amount of salt in the wine.

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

A well-balanced distribution with a few large outliers.

Free sulfur dioxide (SO2)

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Again, fairly well balanced, with a few large outliers.

How many wines have a concentration over 50 ppm? I used reference [2] to check the conversion to the units of the provided data. The conversion is 1:1.

In relation to the base unit of density (kilograms per cubic meter):

1 Part Per Million (ppm) = 0.001 kilograms-per-cubic-meter

1 Milligram Per Cubic Decimeter (mg/dm3) = 0.001 kilograms-per-cubic-meter.

##  fixed.acidity   volatile.acidity  citric.acid     residual.sugar  
##  Min.   :4.200   Min.   :0.1050   Min.   :0.0000   Min.   : 0.900  
##  1st Qu.:6.400   1st Qu.:0.2200   1st Qu.:0.2700   1st Qu.: 5.675  
##  Median :6.700   Median :0.2600   Median :0.3400   Median : 9.200  
##  Mean   :6.804   Mean   :0.2716   Mean   :0.3652   Mean   : 9.027  
##  3rd Qu.:7.300   3rd Qu.:0.3100   3rd Qu.:0.4400   3rd Qu.:12.600  
##  Max.   :9.600   Max.   :0.6950   Max.   :1.2300   Max.   :23.500  
##                                                                    
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   : 50.5       Min.   : 97.0       
##  1st Qu.:0.04075   1st Qu.: 54.0       1st Qu.:155.0       
##  Median :0.04600   Median : 58.0       Median :178.0       
##  Mean   :0.05021   Mean   : 61.5       Mean   :179.8       
##  3rd Qu.:0.05100   3rd Qu.: 64.0       3rd Qu.:201.2       
##  Max.   :0.34600   Max.   :289.0       Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol      
##  Min.   :0.9874   Min.   :2.860   Min.   :0.2500   Min.   : 8.000  
##  1st Qu.:0.9942   1st Qu.:3.080   1st Qu.:0.4200   1st Qu.: 9.200  
##  Median :0.9959   Median :3.160   Median :0.4800   Median : 9.500  
##  Mean   :0.9957   Mean   :3.169   Mean   :0.4999   Mean   : 9.861  
##  3rd Qu.:0.9976   3rd Qu.:3.230   3rd Qu.:0.5600   3rd Qu.:10.400  
##  Max.   :1.0024   Max.   :3.760   Max.   :0.9900   Max.   :13.500  
##                                                                    
##     quality      quality.cat
##  Min.   :3.000   3:  5      
##  1st Qu.:5.000   4: 14      
##  Median :6.000   5:349      
##  Mean   :5.705   6:390      
##  3rd Qu.:6.000   7: 85      
##  Max.   :9.000   8: 24      
##                  9:  1
## [1] "Count: 868 of 4898, Share: 17.72%"

So ca. 18% may have a noticeable bad odour.

It would be interesting to see if these wines are rated lower compared to the rest.

## Up to 50 has mean: 5.915136 Above 50 has mean: 5.705069

Which does confirm the idea that above 50 there is a noticeable bad odour, so quality ratings are lower. Could run a test to confirm the difference is significant, but not now. The counts are large enough to support the idea.

Total sulfur dioxide (SO2)

Amount of free and bound forms of SO2.

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Mostly symmetrical with a slight skew. No real need to transform.

New variable: Bound sulfur dioxide

Since Total Sulfur Dioxide includes Free and Bound forms a new variable can be created which is: Bound Sulfur Dioxide = Total Sulfur Dioxide - Free Sulfur Dioxide.

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

This variable could be used together with Free Sulfur Dioxide. Another possibility is to examine the proportion of free to bound sulfur dioxide as a variable.

Density

The density of wine is close to that of water, depending on the percent alcohol and sugar content.

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Again, there are some values that were probably already rounded and so occur more frequently.

pH

Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Compared to the description of “most wines are between 3 and 4”, this distribution is somewhat further towards the 3 value, but otherwise no particular anomalies.

Sulphates

A wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Mostly symmetrical with a slight skew. No real need to transform. The appearance of being bimodal is due to an unusually high number of 0.38 results. There is a higher number of 0.50 results which may be due to some rounding.

Alcohol

The percent alcohol content of the wine. Wine tends to be between 9 and 16%.

View a summary of the statistics.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

There seem to be two or three modes in the distribution. This may be caused by attempts to achieve a particular alcohol content. E.g. around 9.5, 10.5 and 12.5. Some peaks may also be due to rounding, e.g. 10.0, 10.5, 11.0, 12.0, 12.5.

Like residual sugar, it has the characteristics of a value that is being targeted, rather than a natural random variation.

Check the relationship between alcohol content and residual sugar. I believe that high alcohol corresponds with low residual sugar because more is used in the fermentation.

This is a distribution without outliers, which is probably good for the customers!

Quality

The subjective rated quality of the wine. The quality rating is the median of three independent ratings on a scale of 0 to 10.

Confirm that the ratings are whole integers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Looking at the frequency of each category there are only 5 wines behind the 9 rating and 20 behind the 3 rating. Those are small sample sizes.

Visualize those frequencies in a bar chart.

There are no 0s, 1s, 2s, or 10s. These are extremely bad or good ratings and may also be missing due to the way in which the median of the scores of at least 3 assessors are used. So it is hard to get 0 or 10.

The distribution seems well-balanced with a range of values covering all usual quality levels, which will help find relationships between quality and the other variables. The mode (and median) is 6, so slightly in favour of better wines, but no issue.

Univariate Analysis

The structure of the dataset

There are 11 continuous variables representing measurements of various physicochemical properties, plus a variable representing the subjective assessment of the wine quality (using an ordinal discrete scale from 0 to 10). There is an additional variable (X) that is simply a row index and so was dropped.

There are 4898 rows of data, each representing a wine that was measured and assessed.

The main feature(s) of interest

The quality variable is most important as the intention of the dataset is to enable quality to be predicted through the other variables.

All the other variables are potential explanatory variables. Most of the variables show a somewhat gaussian distribution, which might imply that these are not particularly “steered” during the winemaking process.

Of main interest would be those 7 explanatory variables that intuitively most likely influence taste: - Residual Sugar i.e. sweetness. - Alcohol i.e. strength of alcohol. - Volatile Acidity i.e. vinegar-like taste. - Citric Acid i.e. freshness and flavor. - Free Sulfur Dioxide (evident in “nose” and taste of wine at high concentrations). - Chlorides i.e. saltiness.

Other features that will support investigation of feature(s) of interest?

The other 4 features are of secondary interest, but the more extreme values need to be considered. Total Sulfur Dioxide includes “Free” and “Bound” forms, but intuitively it is the free forms that impact taste and smell. The bound forms were calculated as an additional variable. Similarly, volatile acidity intuitively affects taste and smell more than fixed acidity.

New variables created from existing variables

Since Total Sulfur Dioxide includes Free and Bound forms a new variable was created which was:

Bound Sulfur Dioxide = Total Sulfur Dioxide - Free Sulfur Dioxide.

Unusual distributions

Residual Sugar and Alcohol show distributions that might be a result of specific interventions to achieve a specific classification. For example, residual sugar determines whether the wine can be labeled “dry”, “sweet”, etc.

Operations to tidy, adjust, or change the form of the data

Although there were a few extreme values in the data, these were left in for now. None of the distributions suggested there would be an advantage in transformations.

Bivariate Plots Section

For the rest of the analysis, shorten the column names. They are:

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "quality.cat"          "bound.sulfur.dioxide"

to be shortened to:

##  [1] "fixed"       "volatile"    "citric"      "sugar"       "chlorides"  
##  [6] "free.so2"    "total.so2"   "density"     "pH"          "sulphates"  
## [11] "alcohol"     "quality"     "quality.cat" "bound.so2"

Lets’s look in detail at each relationship with quality.

Relationships with main output feature (quality or quality.cat)

The following relationships are interesting: quality.cat and each of the explanatory variables.

For each variable, look at the mean and distribution via a box plot, and if necessary, zoom in. Also look at summary statistics.

Alcohol

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3           10.3     10.9      9.77   10.4
## 2 4           10.2     10.3     10.00   10.1
## 3 5            9.81     9.85     9.77    9.5
## 4 6           10.6     10.6     10.5    10.5
## 5 7           11.4     11.5     11.3    11.4
## 6 8           11.6     11.8     11.4    12  
## 7 9           12.2     13.4     10.9    12.5

Alcohol has a clear relationship with quality. The ‘dip’ in the middle is interesting. Maybe there are relationships for ‘above average’ and ‘below average’ quality wines and both involve an increase in alcohol away from the average.

Sulphates

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3           0.475    0.531    0.418   0.44
## 2 4           0.476    0.494    0.458   0.47
## 3 5           0.482    0.487    0.477   0.47
## 4 6           0.491    0.496    0.486   0.48
## 5 7           0.503    0.512    0.494   0.48
## 6 8           0.486    0.508    0.464   0.46
## 7 9           0.466    0.581    0.351   0.46

There is a linear relationship from 5 to 7 not visible in the medians, but otherwise uncertain.

pH

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3            3.19     3.29     3.09   3.22
## 2 4            3.18     3.21     3.16   3.16
## 3 5            3.17     3.18     3.16   3.16
## 4 6            3.19     3.19     3.18   3.18
## 5 7            3.21     3.22     3.20   3.2 
## 6 8            3.22     3.24     3.20   3.23
## 7 9            3.31     3.41     3.21   3.28

A linear relationship from 5 to 7, uncertain with other categories.

Density

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3           0.995    0.996    0.994  0.994
## 2 4           0.994    0.995    0.994  0.994
## 3 5           0.995    0.995    0.995  0.995
## 4 6           0.994    0.994    0.994  0.994
## 5 7           0.992    0.993    0.992  0.992
## 6 8           0.992    0.993    0.992  0.992
## 7 9           0.991    0.995    0.988  0.990

Zoom in on the box plot.

This is a clear linear relationship for 5 and above, slightly disrupted by category 4.

Free sulfur dioxide (SO2)

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3            53.3     85.8     20.8   33.5
## 2 4            23.4     26.5     20.2   18  
## 3 5            36.4     37.4     35.5   35  
## 4 6            35.7     36.3     35.0   34  
## 5 7            34.1     35.0     33.2   33  
## 6 8            36.7     39.1     34.3   35  
## 7 9            33.4     50.1     16.7   28

Zoom in on the box plot.

No significant relationship but interesting that low levels relate to category 4, because it is the high levels of free so2 that are supposed to be detectable.

Bound sulfur dioxide (SO2)

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3           117.     150.      84.8    106
## 2 4           102.     108.      95.4    102
## 3 5           114.     116.     113.     114
## 4 6           101.     103.     100.0     97
## 5 7            91.0     92.8     89.1     86
## 6 8            89.4     93.4     85.5     84
## 7 9            82.6    110.      55.3     82

A similar pattern to density, again with category 4 not agreeing with the linear relationship.

Total sulfur dioxide (SO2)

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3            171.     221.    120.    160.
## 2 4            125.     133.    117.    117 
## 3 5            151.     153.    149.    151 
## 4 6            137.     139.    135.    132 
## 5 7            125.     127.    123.    122 
## 6 8            126.     131.    121.    122 
## 7 9            116      141.     91.4   119

Since free sulfur dioxide showed no particular pattern over the quality categories, the pattern here in total is driven by the bound part. So total and bound look similar and are highly correlated.

Chlorides

## # A tibble: 7 x 5
##   quality.cat   mean upper_ci lower_ci median
##   <fct>        <dbl>    <dbl>    <dbl>  <dbl>
## 1 3           0.0543   0.0760   0.0326  0.041
## 2 4           0.0501   0.0541   0.0461  0.046
## 3 5           0.0515   0.0529   0.0502  0.047
## 4 6           0.0452   0.0461   0.0444  0.043
## 5 7           0.0382   0.0389   0.0375  0.037
## 6 8           0.0383   0.0403   0.0364  0.036
## 7 9           0.0274   0.0366   0.0182  0.031

Zoom in on the box plot.

Chlorides show a clear linear relationship with quality.

Residual sugar

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3            6.39     8.88    3.90    4.6 
## 2 4            4.63     5.27    3.98    2.5 
## 3 5            7.33     7.61    7.06    7   
## 4 6            6.44     6.66    6.23    5.3 
## 5 7            5.19     5.47    4.90    3.65
## 6 8            5.67     6.31    5.04    4.3 
## 7 9            4.12     8.79   -0.548   2.2

Zoom in on the box plot.

Again, some linear relationship from 5 to 7, disrupted by category 4 and to some extent 8.

Citric acid

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3           0.336    0.374    0.298  0.345
## 2 4           0.304    0.330    0.279  0.290
## 3 5           0.338    0.345    0.330  0.32 
## 4 6           0.338    0.343    0.333  0.32 
## 5 7           0.326    0.331    0.320  0.31 
## 6 8           0.327    0.339    0.314  0.32 
## 7 9           0.386    0.488    0.284  0.36

Zoom in on the box plot.

The differences between categories are minimal and 4 again is an exception.

Volatile acidity

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3           0.333    0.399    0.267   0.26
## 2 4           0.381    0.408    0.354   0.32
## 3 5           0.302    0.307    0.297   0.28
## 4 6           0.261    0.264    0.257   0.25
## 5 7           0.263    0.269    0.257   0.25
## 6 8           0.277    0.294    0.261   0.26
## 7 9           0.298    0.370    0.226   0.27

Again an interesting “dip” around 6 and 7. Given the confidence intervals, it could be argued that for 6 and above higher quality is not related to lower volatile acidity, but higher volatile acidity does relate to lower quality ratings. This could be an example where the effect of a variable is non-linear - high levels lead to poor quality, low levels do not lead to a higher quality. This might partition the results at around 0.26 volatile acidity.

Fixed acidity

## # A tibble: 7 x 5
##   quality.cat  mean upper_ci lower_ci median
##   <fct>       <dbl>    <dbl>    <dbl>  <dbl>
## 1 3            7.6      8.41     6.79    7.3
## 2 4            7.13     7.30     6.96    6.9
## 3 5            6.93     6.98     6.89    6.8
## 4 6            6.84     6.87     6.80    6.8
## 5 7            6.73     6.78     6.68    6.7
## 6 8            6.66     6.78     6.53    6.8
## 7 9            7.42     8.64     6.20    7.1

This shows a clear linear relationship with quality.

An overview of just the correlations would be helpful. Let’s use corrgram.

The variables have been “effect-ordered” using principal component analysis of the correlation matrix, such that variables correlating are clustered together. See reference [4].

There are strong correlations between sugar and density (positive) and between alcohol and density (negative). The relationships between these three need to be explored.

Alcohol has the strongest positive correlation with quality: more is better. Most other variables have a negative correlation with quality: less is better.

There is a cluster of positive correlation around total so2, bound so2, sugar and density. Total so2 and the derived variable bound so2 are very strongly correlated with each other so that for the models I must choose between them.

I decided to continue with total.so2 since I can then compare the results of any models with results from others. It might be interesting in the future to see whether using the bound variable would improve model results.

Check interesting relationships between the original explanatory variables.

##                 fixed    volatile      citric       sugar   chlorides
## fixed      1.00000000 -0.02269729  0.28918070  0.08902070  0.02308564
## volatile  -0.02269729  1.00000000 -0.14947181  0.06428606  0.07051157
## citric     0.28918070 -0.14947181  1.00000000  0.09421162  0.11436445
## sugar      0.08902070  0.06428606  0.09421162  1.00000000  0.08868454
## chlorides  0.02308564  0.07051157  0.11436445  0.08868454  1.00000000
## free.so2  -0.04939586 -0.09701194  0.09407722  0.29909835  0.10139235
## total.so2  0.09106976  0.08926050  0.12113080  0.40143931  0.19891030
## density    0.26533101  0.02711385  0.14950257  0.83896645  0.25721132
## pH        -0.42585829 -0.03191537 -0.16374821 -0.19413345 -0.09043946
## sulphates -0.01714299 -0.03572815  0.06233094 -0.02666437  0.01676288
## alcohol   -0.12088112  0.06771794 -0.07572873 -0.45063122 -0.36018871
##                free.so2    total.so2     density            pH   sulphates
## fixed     -0.0493958591  0.091069756  0.26533101 -0.4258582910 -0.01714299
## volatile  -0.0970119393  0.089260504  0.02711385 -0.0319153683 -0.03572815
## citric     0.0940772210  0.121130798  0.14950257 -0.1637482114  0.06233094
## sugar      0.2990983537  0.401439311  0.83896645 -0.1941334540 -0.02666437
## chlorides  0.1013923521  0.198910300  0.25721132 -0.0904394560  0.01676288
## free.so2   1.0000000000  0.615500965  0.29421041 -0.0006177961  0.05921725
## total.so2  0.6155009650  1.000000000  0.52988132  0.0023209718  0.13456237
## density    0.2942104109  0.529881324  1.00000000 -0.0935914935  0.07449315
## pH        -0.0006177961  0.002320972 -0.09359149  1.0000000000  0.15595150
## sulphates  0.0592172458  0.134562367  0.07449315  0.1559514973  1.00000000
## alcohol   -0.2501039415 -0.448892102 -0.78013762  0.1214320987 -0.01743277
##               alcohol
## fixed     -0.12088112
## volatile   0.06771794
## citric    -0.07572873
## sugar     -0.45063122
## chlorides -0.36018871
## free.so2  -0.25010394
## total.so2 -0.44889210
## density   -0.78013762
## pH         0.12143210
## sulphates -0.01743277
## alcohol    1.00000000

Relationships between supporting (explanatory) variables

The following relationships are interesting (r is the correlation coefficient for complete dataset): 1. density and residual sugar (r = 0.84) 2. alcohol and density (r = -0.78) 3. total.so2 and density (r = 0.53) 4. alcohol and residual sugar (r = -0.45) 5. total.so2 and alcohol (r = -0.45) 6. pH and fixed.acidity (r = -0.43)

Since we now have a reasonable number of factors (6), let’s use pairs in the psych package to look at them.

The lower part shows bivariate 68% concentration ellipses and loess smoothing curves so we can see where the most data lies and how much slope in that region.

pH does not correlate with any variables except fixed acidity and that might be because pH itself is a scale of base vs acidity.

For each combination, we’ll generate two plots: one with the full data and one with a random 500 sample.

1. density and residual sugar (r = 0.84)

There is a rather extreme outlier in the data.

##   fixed volatile citric sugar chlorides free.so2 total.so2 density   pH
## 1   7.8    0.965    0.6  65.8     0.074        8       160 1.03898 3.39
##   sulphates alcohol quality quality.cat
## 1      0.69    11.7       6           6

Leave it in the dataset for now on the assumption that all values are valid and verified. Given the size of the dataset and that the outlier is not extremely far away from the rest of the ‘cloud’ (not necessarily a corrupt value), it shouldn’t skew the values too much and is in category 6 where there is plenty of other data to reduce its effect.

There is a concentration ‘band’ of values at the lower end of the residual sugar range, just above zero. So, regardless of density, some effect causes these to be more frequent.

2. alcohol and density (r = -0.78)

Strong linear relationship. The large outlier has already been examined.

3. total so2 and density (r = 0.50)

Moderate linear relationship. The large outlier has already been examined.

4. alcohol and residual sugar (r = -0.45)

This is a less natural relationship as there is banding at the lower levels of sugar as itapproaches zero. There may be a limit in how much sugar can be reduced. There are also many cases where both low alcohol and low sugar occur.

5. total so2 and alcohol (r = -0.45)

Moderate linear relationship with some asymmetry of the ‘cloud’.

6. pH and fixed acidity (r = -0.43)

Moderate linear relationship.

Bivariate Analysis

Relationships observed: feature of interest vs. other features

Of the explanatory variables, 9 show a clear relationship with the quality categories. The two that don’t were: free so2 and citric acid. By “clear relationship” I mean I can see a consistent relationship with some of the categories and by looking at the confidence intervals rather than the mean, I can imagine the relationship holding if the sample size were larger (of course a confidence interval is not a prediction interval!). This visual approximation was made at the upper and lower categories (3,4,8,9), which include fewer wines. Looking at the frequencies again:

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

There are only 5 wines in the 9 category and 20 in the 3 category.

Volatile acidity had a clear relationship for categories up to 6, then leveled out.

There were two effects that stood out during the investigation: 1. The quality category 4 results seem to be an exception for many of the variable relationships. This occurred for: alcohol, residual sugar, density, free so2, total so2, and citric acid. 2. There is a ‘band’ of values around the lower levels of residual sugar (nearing zero), which means they occurred more frequently in the data than the other residual sugar values. We already saw that residual sugar and alcohol have distributions that suggest an intervention taking place, either in the making of the wine or in the selection process of the dataset. There may have been some stratification in the selection process that ensures certain segments of wine were represented. There is no information on this available.

Interesting relationships between the other features

The scatterplot of total so2 and alcohol showed an asymmetrical regression line through the cloud. This can be investigated further in the multivariate section.

The strongest relationship

Visually, the strongest relationship with the quality categories was with alcohol and total sulfur dioxide. Between the explanatory variables, the strongest relationship was between density and residual sugar, closely followed by density and alcohol.

The relations between density, alcohol and residual sugar can be investigated further in the multivariate section.

Multivariate Plots Section

Pick a color-blind friendly color scheme for coding the quality categories. Red means poor quality, blue means good quality. The 7 categories are thus:

So average ratings are less intense colors, more divergent for good (blue) and poor (red) ratings have the more intense colors.

First look at alcohol and density. Since presence of alcohol lowers the density, put density on the y-axis.

The improvement of quality with reduced density and increased alcohol can be seen. Otherwise there is nothing particularly stratified about the pattern, which means that alcohol is decisive for quality perception, not density. One could argue, that there is a tendency for high quality wines (8 or 9) to be above the regression line, but it’s a small effect.

Alcohol should partition the data well.

Now look at alcohol and sugar. Sugar is fermented into alcohol, so put alcohol on the y-axis.

Poor quality is common for all residual sugar levels if the alcohol is low enough. I would expect the lower left corner to be missing, if high alcohol means low sugar. The converse is not seen: high alcohol and high residual sugar.

Again, alcohol should partition the data well. Or an interaction of alcohol and sugar.

Lets now investigate further the residual sugar-alcohol-density relationship to see whether residual sugar or alcohol as more influence on density.

Here we see the relationship between alcohol and residual sugar gradually decreasing as the density increases. So sugar doesn’t affect density as much as alcohol, or there would be more slope in the lines with low alcohol/high density. Let’s swap alcohol and density.

Here we see the same relationship between alcohol and density holds consistently at all sugar levels.

I want to look at an example where category 4 did not follow the pattern of the other categories. The biggest disruptive effect was with residual sugar, alcohol, density and total so2. Since residual sugar, density and alcohol are related, I choose residual sugar (y) and total so2 (x). I’m going to facet on quality.

I didn’t see anything in category 4 that looks different from the other categories. Try again with alcohol and density.

Again, no clear reason. So the influences on category 4 are more to do with subtle shifts in distributions, which is picked up by the statistics.

Now let’s look at some of the other variables.

The better ratings tend to be with lower total so2.

I’m curious if free so2 looks any different.

There is a band of free so2 near 0 that seems to be rated particularly poorly. But then the scores with high total so2 are poor. One could split first by one variable then the other, but the total so2 would split more points and so would probably be more important in a partitioning model.

There seem to be better ratings at low fixed acidity. High volatile acidity is not so clear but low seem better. It’s hard to tell which one would partition better.

This one is more interesting. Low chloride levels (less saltiness) are associated with good ratings. There is a cluster around a pH of 3.2 that have much larger chloride levels than the ‘cloud’ and these are almost always rated poorly. Should partition well.

Finally, one of the relationships found earlier: pH and fixed acidity.

In general, high pH corresponds to low acidity, so the relationship makes sense. There is a tendency for good ratings to be with high pH and low fixed acidity.

From the visual analysis, it loks like alcohol, total so2, fixed acidity and chlorides would be important variables.

Multivariate analysis with models

At this stage I am going to look at some simple prediction models to try to find out what the most important variables affecting quality are. I am going to model the quality in two ways. Given the large number of samples, although quality has ordinal discrete values, it is acceptable to treat quality as a continuous variable. For some models, I wlll try to predict the categorical quality values.

I will use a 70%/30% split to generate a random training and test set. The more advanced models with caret will be set up to do n-fold cross-validation using the training set. Where appropriate I will repeat that cross-validation three times. For the continuous models I will look at metrics such as RMSE, R-Squared and MAE, and for the categorical models I will compare Accuracy and Kappa values.

First generate the training and test sets.

Now set up formulae for predicting the continuous and categorical quality.

## quality ~ fixed + volatile + citric + sugar + chlorides + free.so2 + 
##     total.so2 + density + pH + sulphates + alcohol
## quality.cat ~ fixed + volatile + citric + sugar + chlorides + 
##     free.so2 + total.so2 + density + pH + sulphates + alcohol

Using the continuous formula, start simple with a rpart model (Recursive Partitioning And Regression Trees).

First ignore the penalties and see how the complexity parameter (cp) varies with reducing the relative error.

I saw visually the optimum around 5 and added a reference line at that value. Optimization should stop at around 5 nodes. Now let the caret package decide.

It decided on 6. Let’s see what variables it selected.

## 
## Regression tree:
## rpart(formula = formula_continuous, data = treeTrain)
## 
## Variables actually used in tree construction:
## [1] alcohol  free.so2 volatile
## 
## Root node error: 2716.7/3428 = 0.7925
## 
## n= 3428 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.163503      0   1.00000 1.00120 0.025151
## 2 0.048399      1   0.83650 0.84349 0.024057
## 3 0.030740      2   0.78810 0.79671 0.023538
## 4 0.018273      3   0.75736 0.77869 0.022864
## 5 0.010407      4   0.73908 0.75077 0.021718
## 6 0.010000      5   0.72868 0.74924 0.021697

It selected alcohol and free so2 and volatile acidity as the split variables. That’s a bit surprising, but we are making big splits to the data, which are hard to see when viewing the variables two at a time.

Let’s visualize the major tree splits. Plot the left branch of the decision tree.

The first decision is to cover the area below the horizontal line as it has poorer quality. The area to the left of the left vertical line has better quality, then the area between the vertical lines, then the area to the right of the right vertical line.

Now plot the right branch of the decision tree.

The first decision is to cover the area above the lower horizontal line as it has better quality. The area to the left of the vertical line has poorer quality, and for the area to the right of the vertical lines, quality is better above the top horizontal line.

So we can track the decisions roughly on the graphics.

It was interesting that it made a decision based on free so2 and not total so2.

Let’s see how this simple model performs. First make a set of predictions with the test data.

Let’s look at the range of predictions, because the splits so far are distinguishing good/bad areas around the middle. Compare predictions to observations. First the predictions statistics:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.257   5.547   5.926   5.877   6.209   6.616

Then statistics for the observations being predicted (in the test set):

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    5.00    6.00    5.87    6.00    8.00

We are not predicting beyond categories 5 and 6.

##   3   4   5   6   7   8   9 
##   7  53 415 698 243  54   0

3, 4, 8 and 9 may suffer from low samples, but we should at least reach 7.

Calculate the RMSE metric from the observations and predictions.

## [1] 0.7485636

On average, the predictions are off by ca. 0.75, which might seem quite good but the majority of predictions are distinguishing 5 from 6 from 7.

Let’s try tuning the model using the rpart method in caret. We’ll do 10-fold cross-validations for 3 repeats. Instead of tuning over the default 3 values of Cp, we’ll do 30.

## CART 
## 
## 3428 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3086, 3085, 3085, 3085, 3085, 3085, ... 
## Resampling results across tuning parameters:
## 
##   cp           RMSE       Rsquared   MAE      
##   0.002397130  0.7571325  0.2899568  0.5916011
##   0.002436190  0.7569029  0.2901679  0.5915401
##   0.002468413  0.7566983  0.2902438  0.5916062
##   0.002606896  0.7562768  0.2893742  0.5935173
##   0.002683425  0.7559366  0.2894841  0.5933004
##   0.002892613  0.7542470  0.2907379  0.5931311
##   0.003009863  0.7532875  0.2915859  0.5927112
##   0.003090272  0.7535723  0.2907528  0.5937174
##   0.003241063  0.7531557  0.2908492  0.5937972
##   0.003269676  0.7533839  0.2903486  0.5941263
##   0.003430740  0.7541194  0.2884547  0.5951529
##   0.003561024  0.7540492  0.2880643  0.5950557
##   0.003714169  0.7537236  0.2883697  0.5947478
##   0.003737597  0.7540729  0.2877677  0.5947709
##   0.004422305  0.7537121  0.2867248  0.5961655
##   0.004515881  0.7540622  0.2860141  0.5964520
##   0.004709184  0.7553321  0.2835749  0.5974940
##   0.004981479  0.7557710  0.2826124  0.5986878
##   0.005011718  0.7559164  0.2823196  0.5989794
##   0.005653559  0.7579347  0.2781485  0.6010130
##   0.006130020  0.7613270  0.2713719  0.6034467
##   0.006149680  0.7613411  0.2713524  0.6034156
##   0.006351919  0.7624079  0.2690498  0.6045826
##   0.007046163  0.7640770  0.2654774  0.6067103
##   0.007103542  0.7641887  0.2652305  0.6067072
##   0.010407236  0.7671678  0.2588612  0.6120345
##   0.018273243  0.7737322  0.2464031  0.6224251
##   0.030739657  0.7860451  0.2221236  0.6320496
##   0.048398871  0.8027803  0.1887521  0.6520621
##   0.163503257  0.8602109  0.1329783  0.6835411
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.003241063.

Optimal RSME was about 0.75, in line with the rpart package predictions.

Visualize the model.

Splits at the higher level are very similar to the rpart model. Again, major decision points are with alcohol, volatile acidity and free so2.

Plot the importance of the variables as determined by the model.

Chlorides ranks second on importance but only appears once on the tree and only very low down. Volatile is also ranked lower than expected. Obviously variables used for splits and importance do not need to line up. I need to research that later.

Would a simple linear regression model do better?

## Linear Regression 
## 
## 3428 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3084, 3086, 3086, 3084, 3085, 3086, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.7591005  0.2744523  0.5923417
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

No practically significant difference from the rpart model.

What does this model think are important variables?

Linear model variable importance

Linear model variable importance

Again a surprise as total so2 virtually disappears and alcohol is pushed right down the list.

I ran the lm without any preprocessing. But lm often runs better with standardized variables (subtract mean from each value and divide by the standard deviation). So run the same model with standardization.

## Linear Regression 
## 
## 3428 samples
##   11 predictor
## 
## Pre-processing: centered (11), scaled (11) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3084, 3085, 3086, 3085, 3085, 3085, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.7595102  0.2731548  0.5925265
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

No practically significant difference from the preprocessing.

Since it is linear regression, we may have an issue with multicollinearity. Run a check. It’s usual to take out columns that correlate over a cutoff of 0.70.

## [1] "Predictors over 0.70 correlation cutoff: density"

In this case that is density, which we already know correlates strongly with alcohol. Run the model again without density. Set up the new formula.

## quality ~ fixed + volatile + citric + sugar + chlorides + free.so2 + 
##     total.so2 + pH + sulphates + alcohol

Now run the model.

## Linear Regression 
## 
## 3428 samples
##   10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3085, 3085, 3085, 3085, 3084, 3086, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   0.767057  0.2584003  0.596313
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

No practically significant difference for the model performance.

Instead of choosing myself, add a Principal Component Analysis to the preprocessing. This will keep the components above a variance threshold (leave the default 0.95). The components will be selected to best capture 95% of the variance. (Density is back in).

## Linear Regression 
## 
## 3428 samples
##   11 predictor
## 
## Pre-processing: centered (11), scaled (11), principal component
##  signal extraction (11) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3086, 3086, 3086, 3085, 3085, 3085, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.7660879  0.2613879  0.5960826
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Again no practically significant improvement. The pca kept in all 11 components. So it sees no unnecessary explanatory variables.

Going back to the model without density, look at the variable importance.

Reduced standardized linear model variable importance

Reduced standardized linear model variable importance

Now look at what the recursive partitioning in caret determined.

Recursive partitioning model variable importance

Recursive partitioning model variable importance

The variable importance in the linear model is now closer to what the partitioning determined.

Now alcohol is back on the top of the lm list. The rest is still quite mixed and it is unclear why important variables in the tree model did not appear high in the tree decision hierarchy.

Neither model is particularly good. There are a number of other models and optimizations that could be attempted.

So let’s try a Support Vector Machine (SVM) model in caret. We’ll start by predicting the quality categories, so we switch to accuracy as a metric. TuneLength of 9 is typical, fitting 9 values in the default grid of cost parameters.

## Support Vector Machines with Linear Kernel 
## 
## 3428 samples
##   11 predictor
##    7 classes: '3', '4', '5', '6', '7', '8', '9' 
## 
## Pre-processing: centered (11), scaled (11) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3085, 3086, 3084, 3086, 3084, 3087, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.5079888  0.1823607
## 
## Tuning parameter 'C' was held constant at a value of 1

Accuracy is poor. We can look at the confusion matrix to see what it means.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   0   0   0   0   0   0
##          5   5  38 249 133  13   4   0
##          6   2  15 166 565 230  50   0
##          7   0   0   0   0   0   0   0
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5537          
##                  95% CI : (0.5279, 0.5794)
##     No Information Rate : 0.4748          
##     P-Value [Acc > NIR] : 8.239e-10       
##                                           
##                   Kappa : 0.2346          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000  0.00000   0.6000   0.8095   0.0000  0.00000
## Specificity          1.000000  1.00000   0.8171   0.4003   1.0000  1.00000
## Pos Pred Value            NaN      NaN   0.5633   0.5496      NaN      NaN
## Neg Pred Value       0.995238  0.96395   0.8385   0.6991   0.8347  0.96327
## Prevalence           0.004762  0.03605   0.2823   0.4748   0.1653  0.03673
## Detection Rate       0.000000  0.00000   0.1694   0.3844   0.0000  0.00000
## Detection Prevalence 0.000000  0.00000   0.3007   0.6993   0.0000  0.00000
## Balanced Accuracy    0.500000  0.50000   0.7085   0.6049   0.5000  0.50000
##                      Class: 9
## Sensitivity                NA
## Specificity                 1
## Pos Pred Value             NA
## Neg Pred Value             NA
## Prevalence                  0
## Detection Rate              0
## Detection Prevalence        0
## Balanced Accuracy          NA

As we saw with the simple tree model, predictions are limited to categories 5 and 6 and don’t extend into the other categories. That’s why the accuracy is so low.

SVM linear model variable importance

SVM linear model variable importance

Here the variable importance is given for each quality category (so X7 means quality category 7). The importance of chlorides is seen for category 7. Alcohol and free so2 are prominent across most categories.

We can try non-linear “radial” basis functions. This will attempt to fit non-linear relations. There are many variables that do not hold a linear relationship across all categories.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3428 samples
##   11 predictor
##    7 classes: '3', '4', '5', '6', '7', '8', '9' 
## 
## Pre-processing: centered (11), scaled (11) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3083, 3086, 3085, 3085, 3085, 3085, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa    
##   0.25  0.5420078  0.2534430
##   0.50  0.5517200  0.2736790
##   1.00  0.5579473  0.2891171
## 
## Tuning parameter 'sigma' was held constant at a value of 0.07628414
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.07628414 and C = 1.

The accuracy seems higher. Let’s see how that impacts the confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   1   0   0   0   0   0
##          5   5  40 258 125   9   1   0
##          6   2  11 156 538 168  36   0
##          7   0   1   1  35  66  17   0
##          8   0   0   0   0   0   0   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5871          
##                  95% CI : (0.5614, 0.6124)
##     No Information Rate : 0.4748          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.321           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3  Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity          0.000000 0.0188679   0.6217   0.7708  0.27160
## Specificity          1.000000 1.0000000   0.8294   0.5168  0.95599
## Pos Pred Value            NaN 1.0000000   0.5890   0.5906  0.55000
## Neg Pred Value       0.995238 0.9646018   0.8479   0.7138  0.86889
## Prevalence           0.004762 0.0360544   0.2823   0.4748  0.16531
## Detection Rate       0.000000 0.0006803   0.1755   0.3660  0.04490
## Detection Prevalence 0.000000 0.0006803   0.2980   0.6197  0.08163
## Balanced Accuracy    0.500000 0.5094340   0.7255   0.6438  0.61380
##                      Class: 8 Class: 9
## Sensitivity           0.00000       NA
## Specificity           1.00000        1
## Pos Pred Value            NaN       NA
## Neg Pred Value        0.96327       NA
## Prevalence            0.03673        0
## Detection Rate        0.00000        0
## Detection Prevalence  0.00000        0
## Balanced Accuracy     0.50000       NA

The non-linear SVM model was better than the linear model. You can see from the confusion matrix that it starts to form a more diagonal pattern, with the inclusion of predictions for category 7.

Generate the variable importance.

SVM non-linear model variable importance

SVM non-linear model variable importance

OK, that looks the same as for the previous linear model, which suggests the method for calculating the importance is independent of the model method.

Finally, to compare with the other models, let’s run the non-linear SVM on the quality column - not as categories but as a continuous variable. We can obtain the RMSE value and compare to the other models.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3428 samples
##   11 predictor
## 
## Pre-processing: centered (11), scaled (11) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 3085, 3086, 3086, 3085, 3086, 3084, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE       Rsquared   MAE      
##   0.25  0.7194641  0.3498053  0.5455965
##   0.50  0.7140075  0.3594573  0.5398859
##   1.00  0.7081665  0.3699813  0.5351721
## 
## Tuning parameter 'sigma' was held constant at a value of 0.07834557
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.07834557 and C = 1.

Generate a summary to look at the range of the predicted values, which was an issue with the early rpart models.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.301   5.400   5.855   5.830   6.231   7.370

Whereas the first rpart model predictions did not extend beyond categories 5 and 6, the new model is now extending into the quality categories 4 and 7.

Look at the variable importance.

Continuous SVM non-linear model variable importance

Continuous SVM non-linear model variable importance

Now compare the 6 caret models we created for predicting the quality as a continuous variable.

Models with continuous quality as target:

RPTC rpart

LM linear regression

LMS linear regression standardized

LMSR linear regression standardized reduced (no density)

LMSP linear regression standardised principal component analysis

SVRQ non-linear SVM (radial)

## 
## Call:
## summary.resamples(object = results_continuous)
## 
## Models: RPT, LM, LMS, LMSR, LMSP, SVRQ 
## Number of resamples: 30 
## 
## MAE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RPT  0.5494183 0.5762993 0.5891019 0.5937972 0.6094715 0.6479804    0
## LM   0.5499537 0.5704551 0.5935185 0.5923417 0.6065380 0.6439286    0
## LMS  0.5537672 0.5773168 0.5917375 0.5925265 0.6072782 0.6299780    0
## LMSR 0.5593763 0.5857704 0.5963897 0.5963130 0.6117304 0.6342780    0
## LMSP 0.5485558 0.5815662 0.5916593 0.5960826 0.6101454 0.6518739    0
## SVRQ 0.4981283 0.5163948 0.5343199 0.5351721 0.5508056 0.5948575    0
## 
## RMSE 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RPT  0.6873955 0.7360141 0.7526879 0.7531557 0.7755424 0.8110713    0
## LM   0.6996858 0.7296200 0.7595643 0.7591005 0.7850522 0.8428440    0
## LMS  0.7030907 0.7375116 0.7634454 0.7595102 0.7769783 0.8444220    0
## LMSR 0.7042386 0.7545738 0.7658895 0.7670570 0.7850160 0.8267510    0
## LMSP 0.6969380 0.7482190 0.7587367 0.7660879 0.7884206 0.8315144    0
## SVRQ 0.6545835 0.6855503 0.7056025 0.7081665 0.7302604 0.7770454    0
## 
## Rsquared 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RPT  0.2002407 0.2605949 0.2904932 0.2908492 0.3112415 0.3735017    0
## LM   0.1843671 0.2467907 0.2701046 0.2744523 0.2944929 0.3604742    0
## LMS  0.1971150 0.2556266 0.2678179 0.2731548 0.2938054 0.3608088    0
## LMSR 0.2103202 0.2348679 0.2513786 0.2584003 0.2833495 0.3458324    0
## LMSP 0.1609149 0.2400387 0.2594913 0.2613879 0.2774192 0.3522588    0
## SVRQ 0.3113475 0.3379534 0.3733088 0.3699813 0.3943314 0.4472066    0

Looks like the SVM model was significantly better on all counts. Visualize the values with confidence intervals.

Continuous model metric comparison

Continuous model metric comparison

No question that the SVM model is the best so far.

Finally, let’s compare the 2 caret models we created for the quality as a categorical variable.

Models with categorical quality as target:

SVML linear SVM

SVMR non-linear SVM (radial)

## 
## Call:
## summary.resamples(object = results_categorical)
## 
## Models: SVML, SVMR 
## Number of resamples: 30 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVML 0.4635569 0.4959891 0.5094625 0.5079888 0.5171666 0.5584795    0
## SVMR 0.5218659 0.5423360 0.5583090 0.5579473 0.5724845 0.6064140    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## SVML 0.1062522 0.1596202 0.1883400 0.1823607 0.1985738 0.2640654    0
## SVMR 0.2246130 0.2658121 0.2912582 0.2891171 0.3085541 0.3687203    0

The non-linear model performs better than the linear model. Visualize the values with confidence intervals.

Categorical model metric comparison

Categorical model metric comparison

Again we see visually that for the categorical models, the non-linear model performs better than the linear model.

Multivariate Analysis

Relationships observed and features that strengthened each other

Changes in density seem to be mainly driven by changes in alcohol levels rather than changes in residual sugar levels. We see the relationship between alcohol and residual sugar gradually decreasing as the density increases. So sugar doesn’t affect density as much as alcohol.

There was a tendency for lower total so2 and lower fixed acidity to relate to higher quality.

Low chlorides (experienced as saltiness) relate to higher quality.

Interesting or surprising interactions between features

Poor quality is common for all residual sugar levels if the alcohol is low enough. We don’t see high alcohol and high residual sugar combinations.

So there are ‘dry’ wines with low residual sugar AND low alcohol. So either the alcohol or the residual sugar was removed from the wine, perhaps? Either way, these combinations are rated poorly, so why would one make that effort? Or the fermentation quickly ran out of sugar and stopped. Perhaps this is a consequence of trying to create a dry wine by starting with less sugar and overdoing it. The wine doesn’t ferment enough alcohol to get a better rating.

Moderately sweet wines can get a good rating if there is enough alcohol. There are a few good wines with low alcohol and moderate residual sugar, but these are the exception. In general, better ratings correspond to more alcohol content.

Models created and their strengths and limitations

I started with a simple rpart partitioning model and then looked at improving this via models using the caret package. In addition to a new cross-validated rpart model, I looked at linear models with standardized variables and with a high correlating variable (density) removed.

The caret rpart model (RMSE 0.75, MAE 0.59) was slightly better than the best linear model (RMSE 0.77, MAE 0.60).

I then generated Support Vector Machine models in two forms: a categorical prediction and a continuous variable prediction. For the categorical prediction I tried a linear and a non-linear model.

The SVM using continuous variable performed better than the rpart and linear models. I achieved an RMSE of 0.71 and an MAE of 0.54.

For the categorical prediction, the non-linear (radial) SVM performed better than the linear SVM.

After the models were generated I compared the performance metrics of my models with those of Cortez et al. in their paper.

They ran SVM on quality as a continuous variable. They achieved an accuracy (tolerance=0.5 to be compatible with my model) of 64.6 with SVM, compared to my non-linear SVM of 58.6. Their Kappa was 43.9 compared to my 32.0. They reported 0.45 MAE (they name it MAD) compared to my 0.54.

So their results were better.

The paper states that the data comes from Vinho Verde, a Portugese wine. According to the Wikipedia entry on Vinho Verde, “The white Vinho Verde is very fresh, due to its natural acidity, with fruity and floral aromas that depend on the grape variety. The white wines are lemon- or straw-coloured, around 8.5 to 11% alcohol, and are made from local grape varieties Loureiro, Arinto, Trajadura, Avesso, and Azal. Vinho Alvarinho is made from Alvarinho grapes, from a small designated subregion of Monção and Melgaço. It has more alcohol (11.5 to 14%) and ripe tropical aromas.”

The distribution of Alcohol values might therefore be overlaying two distributions, one for Vinho Verde “classic” with a range of 8.5% to 11% and one for Vinho Alvarinho with a range of 11% to 14%. But this is speculation and there is no data to distinguish these in the dataset.


Final Plots and Summary

Plot One

Relationship between Alcohol, Residual Sugar and Quality in White Wine

Relationship between Alcohol, Residual Sugar and Quality in White Wine

Description One

I was surprised to see so many data points at low levels of sugar and alcohol The quality was rated poorly for these combinations. Poor quality is common for all residual sugar levels if the alcohol is low enough and there are very few high alcohol and high residual sugar combinations.

So there are ‘dry’ wines with low residual sugar AND low alcohol. So either the alcohol or the residual sugar was removed from the wine, perhaps? Either way, these combinations are rated poorly, so why would one make that effort? Or the fermentation quickly ran out of sugar and stopped. Perhaps this is a consequence of trying to create a dry wine by starting with less sugar and overdoing it. The wine doesn’t ferment enough alcohol to get a better rating.

Moderately sweet wines can get a good rating if there is enough alcohol. There are a few good wines with low alcohol and moderate residual sugar, but these are the exception. In general, better ratings correspond to more alcohol content.

Plot Two

Recursive partitioning and regression tree using 'rpart' package

Recursive partitioning and regression tree using ‘rpart’ package

Description Two

The first model made using the recursive partitioning package rpart showed the importance of alcohol in that it determined the first split. The other variables were somewhat more surprising because visual analysis of the scatterplots seemed to show fixed acidity (rather than volatile acidity) and total so2 (rather than free so2) would lead to a clearer partition. Also, chlorides, which looked like a good candidate for partitioning did not appear.

The reason is likely to be that the visual analysis sees both sides of the tree at once. If the left and right sides were to be split in the data into two sets and then visualized, they might then show the subsequent splits more clearly.

Plot Three

Most important variables from recursive partitioning in 'caret' package

Most important variables from recursive partitioning in ‘caret’ package

Description Three

The rpart model in caret generated slightly better performance metrics than the model from the rpart package. Looking at the the relative importance of the variables, it was surprising to see that volatile acidity was much less important in this model.

The importance of the variables in recursive partitioning is estimated by summing the reduction in the loss function attributed to each variable at each split, including the top competing variables. Perhaps the reason for this difference is that the caretmodel went a number of levels deeper which influenced the cumulative loss reduction contribution.

An interesting aspect of this model is that chlorides become much more important, which was a finding from the visual analysis.

The final surprise occurred when reading the variable importance in the paper from Cohen et al. In their list. sulphates were the most important variable, whereas in this model, and in the visual analyses, sulphates was not considered important at all.


Reflection

I was pleased with the way the plots worked out. Two of my three final plots were generated by packages and I couldn’t find any way to customize them further. But I added captions. I used them because they were key results from this analysis.

The mean_box helper function was supposed to be a time-saver but I had difficulties passing the arguments into the groupby and aesthetics and there is probably a more elegant way of doing that. I burnt more time than I saved, but I learned a lot. I will use this function again.

I struggled with getting the right mixture of granularity and color for the multivariate plots.

I need to improve my knowledge of further optimizing models.

I did not achieve the model performance reported by Cohen et al. in their paper, but they conducted an exhaustive optimization of the hyperparameters and of the selected features, that I didn’t have time for (and that would have made this report much longer!). Future work could include such optimization and models created with other packages than caret.

The random sampling method that I used left some categories with low sample counts because I withheld 30% for a test set. But the caret package runs cross-validations to assess its model performance. So it might be possible to use the whole data to create models and see how caret assesses them. There would be no way to verify them so there is a risk of overfitting.

I have not used caret before and was surprised how quickly I could get to useful models. I would like to better understand why there are so many variations in the lists of important variables depending on the model. E.g. Cohen et al. reported sulphates as their most important variable, which never came up in my analysis.

It was helpful not to read ahead or research the topic in advance, as it did not bias my analysis. There have been a lot of analyses with this dataset. Future work could include a meta-analysis of the work on this dataset to identify gaps and new approaches. I did get some help from reference [5] which had some example rpart package models with this same dataset.

Now I’m looking forward to a glass of wine…

References

[1] https://stackoverflow.com/questions/8317231/elegant-way-to-report-missing-values-in-a-data-frame

[2] http://www.justintools.com/unit-conversion/density.php?k1=milligrams-per-cubic-decimeter&k2=parts-per-million

[3] https://stackoverflow.com/questions/28427572/manipulating-axis-titles-in-ggpairs-ggally

[4] https://stats.stackexchange.com/questions/26920/variable-ordering-using-pca

[5] Book: “Machine Learning with R” by Brett Lanz ISBN 978-1-78216-214-8

[6] https://stackoverflow.com/questions/38220963/how-to-pass-a-variable-name-in-group-b

[7] https://stackoverflow.com/questions/41783396/passing-arguments-to-dplyr-summarize-function

[8] https://stackoverflow.com/questions/44141193/apply-jittering-to-outliers-data-in-a-boxplot-with-ggplot2